-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed regex for calculation of percent hemoglobin genes #229
base: master
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Dear @KriBaLin , thank you!
So What do you think? |
Dear @Zethson, thank you for your fast reply. Sorry, there was a formatting mistake in my first post that turned the suggested Regarding the expression being too specific, I'm honestly not experienced enough to judge this with respect to future changes of gene annotations or the like. Currently, when I search the 36601 genes of my human data set for genes starting with "HB", I get 13 hits: HBEGF, HBS1L, HBP1, HBB, HBD, HBG1, HBG2, HBE1, HBZ, HBM, HBA2, HBA1, HBQ1; The first 3 don't seem to be hemoglobin-genes. The regex A (maybe more robust?) option could be to explicitly check for a list of hemoglobin genes - as suggested by Konrad Rudolph on stackoverflow. |
Guess one could look at Ensemble gene symbols to see how this regex would affect it. A list of genes is also possible but then we'd need the list ^_^ |
I agree with @klmr that an explicit list is preferable over a regex. Not sure what a "trusted source" of hemoglobin genes would be, but results 1-10 from this genescards search is probably a good start. At least it for sure doesn't include anything unexpected. |
Thank you very much @grst for the link! We'll make the changes accordingly using the list. |
I found this helpful, but would like to note for anyone else looking around for this that the pseudogenes in the mouse genome are denoted with a -p (e.g., Hba-ps4). I went with I am planning to use |
Dear Theis lab,
thank you a lot for your very helpful book and tutorials.
I am currently performing my first analysis of scRNAseq data. During step 6.3 (filtering low quality reads) I wanted to understand the regex for filtering hemoglobin genes ("^HB[^(P)]").
I noticed that this regex not only includes hemoglobin-genes, but also the genes HBEGF, HBS1L, and HBP1.
I was trying to find a more specific regex to match only the hemoglobin genes, with some help from stackoverflow. I'd suggest
"^HB(?!EGF|S1L|P1).+"
, which I changed in the jupyter notebook, an alternative might be"^HB[^(P|S)]($|[^G])"
.This applies to human data, however we briefly confirmed that these regexs are applicable (with lowercase characters) to mouse data, too.
Please correct me if I am wrong and the original regex performs in the way intended by you. In this case, I would suggest extending the documentation for clarification.
Best,
Kristina
edit: added code backticks to the suggested regexs for correct display